Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv("stock_data.csv")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
Dataset has 15 columns of data and 340 rows. The first 4 columns are object type with string datatypes. No missing values that are apparent since they all contain 340 rows. The remaining columns are of int64 or float64 datatype, so they are suitable for cluster analysis.
Questions:
340 rows of data with 15 columns. The first 4 column are object type data with String type variables I will drop for the purpose of this analysis. The rest of the columns are int64 or float64 so will be suited to a in depth analysis.
data.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
No missing values in the dataset.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | 8.086234e+01 | 9.805509e+01 | 4.500000e+00 | 3.855500e+01 | 5.970500e+01 | 9.288000e+01 | 1.274950e+03 |
| Price Change | 340.0 | 4.078194e+00 | 1.200634e+01 | -4.712969e+01 | -9.394838e-01 | 4.819505e+00 | 1.069549e+01 | 5.505168e+01 |
| Volatility | 340.0 | 1.525976e+00 | 5.917984e-01 | 7.331632e-01 | 1.134878e+00 | 1.385593e+00 | 1.695549e+00 | 4.580042e+00 |
| ROE | 340.0 | 3.959706e+01 | 9.654754e+01 | 1.000000e+00 | 9.750000e+00 | 1.500000e+01 | 2.700000e+01 | 9.170000e+02 |
| Cash Ratio | 340.0 | 7.002353e+01 | 9.042133e+01 | 0.000000e+00 | 1.800000e+01 | 4.700000e+01 | 9.900000e+01 | 9.580000e+02 |
| Net Cash Flow | 340.0 | 5.553762e+07 | 1.946365e+09 | -1.120800e+10 | -1.939065e+08 | 2.098000e+06 | 1.698108e+08 | 2.076400e+10 |
| Net Income | 340.0 | 1.494385e+09 | 3.940150e+09 | -2.352800e+10 | 3.523012e+08 | 7.073360e+08 | 1.899000e+09 | 2.444200e+10 |
| Earnings Per Share | 340.0 | 2.776662e+00 | 6.587779e+00 | -6.120000e+01 | 1.557500e+00 | 2.895000e+00 | 4.620000e+00 | 5.009000e+01 |
| Estimated Shares Outstanding | 340.0 | 5.770283e+08 | 8.458496e+08 | 2.767216e+07 | 1.588482e+08 | 3.096751e+08 | 5.731175e+08 | 6.159292e+09 |
| P/E Ratio | 340.0 | 3.261256e+01 | 4.434873e+01 | 2.935451e+00 | 1.504465e+01 | 2.081988e+01 | 3.176476e+01 | 5.280391e+02 |
| P/B Ratio | 340.0 | -1.718249e+00 | 1.396691e+01 | -7.611908e+01 | -4.352056e+00 | -1.067170e+00 | 3.917066e+00 | 1.290646e+02 |
There is a wide range in the magnitude of data values between the separate columns. The data will be normalized and remove problematic outliers prior to analysis to ensure the best results.
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
df = data.copy()
df.duplicated().sum()
0
No duplicated values exist in the dataframe.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# selecting numerical columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
for item in num_cols:
histogram_boxplot(data, item, bins=50, kde=True, figsize=(10, 5))
Questions:
plt.figure(figsize=(15, 7))
sns.heatmap(
data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Questions:
No correlations in the data were higher than 0.60 or lower than -0.4, so no strong correlations in the whole dataset. This is good since it shows that the input variables are independent of each other which will aid in clustering.
Highest correlations were between Net Income and Estimated Shares Outstanding at 0.59 and Net Income and Earnings Per Share at 0.56.
sns.pairplot(data=df, diag_kind="kde")
plt.show()
Questions:
sns.set(font_scale = 4)
ax = sns.boxplot(x="GICS Sector", y="Price Change", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("GICS Sector", fontsize = 10)
ax.set_ylabel("Price Change", fontsize = 10)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
Questions:
sns.set(font_scale = 4)
ax = sns.boxplot(x="GICS Sector", y="Cash Ratio", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("GICS Sector", fontsize = 30)
ax.set_ylabel("Cash Ratio", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
The following industries have the highest average cash ratio: Information Technology, Financials, and Health Care are the top three. The rest of the industries have lower averages of cash ratio.
Questions:
sns.set(font_scale = 4)
ax = sns.boxplot(x="GICS Sector", y="P/E Ratio", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("GICS Sector", fontsize = 30)
ax.set_ylabel("P/E Ratio", fontsize = 30)
ax.set_ylim(0, 200)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
#plt.figure(figsize=(20, 30))
#sns.set(font_scale = 4)
#ax = sns.boxplot(x="market_segment_type", y="lead_time", hue = "booking_status", data=data)
#ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
#ax.set_xlabel("Market Segment Type", fontsize = 30)
#ax.set_ylabel("Lead Time (days)", fontsize = 30)
#sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.rc('axes', titlesize=10)
plt.rc('axes', labelsize=10)
plt.figure(figsize=(10, 10), dpi=80)
plt.show()
<Figure size 800x800 with 0 Axes>
All the columns will be floored and capped for outliers to optimize cluster modeling.
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
ldata = data.copy()
cols_list = ("Current Price", 'Price Change', 'Volatility', 'ROE','Cash Ratio','Net Cash Flow',
'Net Income','Earnings Per Share','Estimated Shares Outstanding','P/E Ratio','P/B Ratio')
ldata = treat_outliers_all(ldata, cols_list)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(ldata[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Outliers have been removed by assigning either the top or bottom whisker value with for the dataset.
df1 =ldata.copy()
df1 = df1.drop(['Ticker Symbol','Security','GICS Sector','GICS Sub Industry'], axis=1)
df1.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | 7.033351e+01 | 4.306783e+01 | 4.500000e+00 | 3.855500e+01 | 5.970500e+01 | 9.288000e+01 | 1.743675e+02 |
| Price Change | 340.0 | 4.347886e+00 | 1.030507e+01 | -1.839195e+01 | -9.394838e-01 | 4.819505e+00 | 1.069549e+01 | 2.814796e+01 |
| Volatility | 340.0 | 1.482875e+00 | 4.564288e-01 | 7.331632e-01 | 1.134878e+00 | 1.385593e+00 | 1.695549e+00 | 2.536555e+00 |
| ROE | 340.0 | 2.051801e+01 | 1.530811e+01 | 1.000000e+00 | 9.750000e+00 | 1.500000e+01 | 2.700000e+01 | 5.287500e+01 |
| Cash Ratio | 340.0 | 6.308824e+01 | 5.839559e+01 | 0.000000e+00 | 1.800000e+01 | 4.700000e+01 | 9.900000e+01 | 2.205000e+02 |
| Net Cash Flow | 340.0 | -4.604551e+06 | 4.186738e+08 | -7.394824e+08 | -1.939065e+08 | 2.098000e+06 | 1.698108e+08 | 7.153866e+08 |
| Net Income | 340.0 | 1.197628e+09 | 1.530431e+09 | -1.967747e+09 | 3.523012e+08 | 7.073360e+08 | 1.899000e+09 | 4.219048e+09 |
| Earnings Per Share | 340.0 | 3.098004e+00 | 2.896002e+00 | -3.036250e+00 | 1.557500e+00 | 2.895000e+00 | 4.620000e+00 | 9.213750e+00 |
| Estimated Shares Outstanding | 340.0 | 4.386644e+08 | 3.562553e+08 | 2.767216e+07 | 1.588482e+08 | 3.096751e+08 | 5.731175e+08 | 1.194521e+09 |
| P/E Ratio | 340.0 | 2.556540e+01 | 1.464714e+01 | 2.935451e+00 | 1.504465e+01 | 2.081988e+01 | 3.176476e+01 | 5.684491e+01 |
| P/B Ratio | 340.0 | -1.220577e+00 | 7.594997e+00 | -1.675574e+01 | -4.352056e+00 | -1.067170e+00 | 3.917066e+00 | 1.632075e+01 |
| HC_Clusters | 340.0 | 2.332353e+00 | 2.708799e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 | 8.000000e+00 |
scaler = StandardScaler()
subset = df1.copy()
subset_scaled = scaler.fit_transform(subset)
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df_h = subset_scaled_df.copy()
subset_scaled_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | -1.567374e-17 | 1.001474 | -1.530854 | -0.738959 | -0.247149 | 0.524283 | 2.419145 |
| Price Change | 340.0 | -5.485808e-17 | 1.001474 | -2.209917 | -0.513841 | 0.045833 | 0.616877 | 2.312954 |
| Volatility | 340.0 | -4.862124e-16 | 1.001474 | -1.644982 | -0.763558 | -0.213452 | 0.466638 | 2.311932 |
| ROE | 340.0 | -9.469549e-17 | 1.001474 | -1.276891 | -0.704456 | -0.360995 | 0.424059 | 2.116831 |
| Cash Ratio | 340.0 | 2.024524e-17 | 1.001474 | -1.081952 | -0.773255 | -0.275910 | 0.615880 | 2.699583 |
| Net Cash Flow | 340.0 | -3.689859e-17 | 1.001474 | -1.757838 | -0.452813 | 0.016033 | 0.417204 | 1.722229 |
| Net Income | 340.0 | -1.012262e-17 | 1.001474 | -2.071339 | -0.553160 | -0.320835 | 0.458959 | 1.977138 |
| Earnings Per Share | 340.0 | -2.253100e-17 | 1.001474 | -2.121302 | -0.532726 | -0.070201 | 0.526325 | 2.114902 |
| Estimated Shares Outstanding | 340.0 | 8.449124e-17 | 1.001474 | -1.155346 | -0.786595 | -0.362603 | 0.377963 | 2.124799 |
| P/E Ratio | 340.0 | -3.382915e-16 | 1.001474 | -1.547285 | -0.719339 | -0.324468 | 0.423870 | 2.138684 |
| P/B Ratio | 340.0 | 6.465416e-17 | 1.001474 | -2.048462 | -0.412916 | 0.020228 | 0.677448 | 2.312994 |
| HC_Clusters | 340.0 | -5.877651e-17 | 1.001474 | -0.862297 | -0.862297 | -0.492586 | 0.616548 | 2.095393 |
The Scaled data now shows all values within a similar magnitude and the outliers have been removed.
# selecting numerical columns
num_cols = subset_scaled_df.columns.tolist()
for item in num_cols:
histogram_boxplot(subset_scaled_df, item, bins=50, kde=True, figsize=(10, 5))
The outliers have successfully been removed from the dataset and now all values are scaled to be within similar magnitude for all variables. Distributions do not all appear normal, but most are roughly normal type distributions.
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction=model.predict(subset_scaled_df)
meanDistortions.append(sum(np.min(cdist(subset_scaled_df, model.cluster_centers_, 'euclidean'), axis=1)) / subset_scaled_df.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.23184954163256857 For n_clusters = 3, silhouette score is 0.1820225584326117 For n_clusters = 4, silhouette score is 0.1962197695163508 For n_clusters = 5, silhouette score is 0.20803275784490818 For n_clusters = 6, silhouette score is 0.21439911948692938 For n_clusters = 7, silhouette score is 0.19680180764118985 For n_clusters = 8, silhouette score is 0.1380954111351497 For n_clusters = 9, silhouette score is 0.14153692299213602
[<matplotlib.lines.Line2D at 0x256ddfd4fa0>]
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
4 Clusters has a good silhouette score and looked to be a reasonable elbow for the k-means method.
# let's take 4 as number of clusters
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=0)
df["K_means_segments"] = kmeans.labels_
subset_scaled_df["K_means_segments"] = kmeans.labels_
#Cluster Profiling
cluster_profile = df.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("K_means_segments")["Current Price"].count().values)
# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 58.379608 | 3.436121 | 1.382656 | 36.670886 | 43.645570 | -45673867.088608 | 749855797.468354 | 2.838259 | 270914117.312342 | 20.171557 | -4.169507 | 158 |
| 1 | 31.845644 | -10.239410 | 2.591934 | 80.243243 | 82.648649 | 33563081.081081 | -2910723972.972973 | -7.044054 | 695068471.552433 | 73.160991 | 2.114067 | 37 |
| 2 | 169.416713 | 10.038171 | 1.545044 | 24.700000 | 103.842857 | 5873342.857143 | 860247185.714286 | 5.342429 | 256763623.596857 | 55.097920 | 2.064259 | 70 |
| 3 | 69.756804 | 6.931533 | 1.284234 | 39.613333 | 87.800000 | 325950586.666667 | 5827907106.666667 | 5.097067 | 1462589562.690667 | 17.831392 | -1.975217 | 75 |
subset_scaled_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | -1.567374e-17 | 1.001474 | -1.530854 | -0.738959 | -0.247149 | 0.524283 | 2.419145 |
| Price Change | 340.0 | -5.485808e-17 | 1.001474 | -2.209917 | -0.513841 | 0.045833 | 0.616877 | 2.312954 |
| Volatility | 340.0 | -4.862124e-16 | 1.001474 | -1.644982 | -0.763558 | -0.213452 | 0.466638 | 2.311932 |
| ROE | 340.0 | -9.469549e-17 | 1.001474 | -1.276891 | -0.704456 | -0.360995 | 0.424059 | 2.116831 |
| Cash Ratio | 340.0 | 2.024524e-17 | 1.001474 | -1.081952 | -0.773255 | -0.275910 | 0.615880 | 2.699583 |
| Net Cash Flow | 340.0 | -3.689859e-17 | 1.001474 | -1.757838 | -0.452813 | 0.016033 | 0.417204 | 1.722229 |
| Net Income | 340.0 | -1.012262e-17 | 1.001474 | -2.071339 | -0.553160 | -0.320835 | 0.458959 | 1.977138 |
| Earnings Per Share | 340.0 | -2.253100e-17 | 1.001474 | -2.121302 | -0.532726 | -0.070201 | 0.526325 | 2.114902 |
| Estimated Shares Outstanding | 340.0 | 8.449124e-17 | 1.001474 | -1.155346 | -0.786595 | -0.362603 | 0.377963 | 2.124799 |
| P/E Ratio | 340.0 | -3.382915e-16 | 1.001474 | -1.547285 | -0.719339 | -0.324468 | 0.423870 | 2.138684 |
| P/B Ratio | 340.0 | 6.465416e-17 | 1.001474 | -2.048462 | -0.412916 | 0.020228 | 0.677448 | 2.312994 |
| HC_Clusters | 340.0 | -5.877651e-17 | 1.001474 | -0.862297 | -0.862297 | -0.492586 | 0.616548 | 2.095393 |
| K_means_segments | 340.0 | 1.182353e+00 | 1.234573 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 3.000000 |
| GROUP | 340.0 | 1.655882e+00 | 1.040108 | 0.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 |
df1['K_means_segments'] = subset_scaled_df['K_means_segments']
df1.groupby(by=['K_means_segments']).sum()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | HC_Clusters | GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||||||||
| 0 | 9215.245637 | 551.220159 | 217.747866 | 2958.625 | 6891.5 | -8934606250 | 118477216000 | 448.44500 | 4.280443e+10 | 3187.106036 | -404.353774 | 33 | 316 |
| 1 | 1178.288822 | -256.815852 | 83.286705 | 825.000 | 2492.5 | -2201846125 | -28686612125 | -50.32625 | 2.182001e+10 | 1705.422542 | 77.195360 | 183 | 37 |
| 2 | 8330.880062 | 669.041164 | 106.825492 | 1361.250 | 6656.0 | 5289795250 | 57497351125 | 301.08500 | 1.750308e+10 | 2485.402565 | 42.458935 | 429 | 210 |
| 3 | 5188.977827 | 514.835610 | 96.317513 | 1831.250 | 5410.0 | 4281109625 | 259905706125 | 354.11750 | 6.701839e+10 | 1314.306201 | -130.296726 | 148 | 0 |
data['K_means_segments'] = subset_scaled_df['K_means_segments']
data['HC_Clusters'] = subset_scaled_df['HC_Clusters']
data['GROUP'] = subset_scaled_df['GROUP']
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | K_means_segments | HC_Clusters | GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 3 | -0.122875 | 0 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 3 | -0.122875 | 0 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 3 | -0.492586 | 0 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2 | 1.725682 | 3 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 1 | 0.246837 | 1 |
num_col = ldata.select_dtypes(include=np.number).columns.tolist()
fig, axes = plt.subplots(4, 3, figsize=(20, 35))
fig.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)
counter = 0
for ii in range(4):
sns.boxplot(
ax=axes[ii][0],
y=subset_scaled_df[num_col[counter]],
x=subset_scaled_df["K_means_segments"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][1],
y=subset_scaled_df[num_col[counter]],
x=subset_scaled_df["K_means_segments"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][2],
y=subset_scaled_df[num_col[counter]],
x=subset_scaled_df["K_means_segments"],
)
counter = counter + 1
fig.tight_layout(pad=2.0)
num_col = df1.select_dtypes(include=np.number).columns.tolist()
fig, axes = plt.subplots(4, 3, figsize=(20, 35))
fig.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)
counter = 0
for ii in range(4):
sns.boxplot(
ax=axes[ii][0],
y=data[num_col[counter]],
x=data["K_means_segments"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][1],
y=data[num_col[counter]],
x=data["K_means_segments"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][2],
y=data[num_col[counter]],
x=data["K_means_segments"],
)
counter = counter + 1
fig.tight_layout(pad=2.0)
Cluster 1:
Cluster 2:
Cluster 3:
pd.crosstab(data.K_means_segments, data['GICS Sector']).style.highlight_max(
color="lightgreen", axis=0)
| GICS Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||||||
| 0 | 28 | 9 | 0 | 27 | 10 | 34 | 7 | 10 | 11 | 1 | 21 |
| 1 | 0 | 0 | 24 | 0 | 1 | 1 | 6 | 2 | 2 | 1 | 0 |
| 2 | 8 | 2 | 1 | 4 | 21 | 3 | 10 | 7 | 14 | 0 | 0 |
| 3 | 4 | 8 | 5 | 18 | 8 | 15 | 10 | 1 | 0 | 3 | 3 |
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df_h, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df_h))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.7115641567913893. Cophenetic correlation for Euclidean distance and complete linkage is 0.6345330419501592. Cophenetic correlation for Euclidean distance and average linkage is 0.7603860981510149. Cophenetic correlation for Euclidean distance and weighted linkage is 0.6744876679395055. Cophenetic correlation for Chebyshev distance and single linkage is 0.7296085000603678. Cophenetic correlation for Chebyshev distance and complete linkage is 0.33219036821360853. Cophenetic correlation for Chebyshev distance and average linkage is 0.7459145447489274. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.47875806258218584. Cophenetic correlation for Mahalanobis distance and single linkage is 0.6509548425977477. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.49973049130649655. Cophenetic correlation for Mahalanobis distance and average linkage is 0.666707471599979. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.586737151754734. Cophenetic correlation for Cityblock distance and single linkage is 0.6839906681219571. Cophenetic correlation for Cityblock distance and complete linkage is 0.6527612737553125. Cophenetic correlation for Cityblock distance and average linkage is 0.7458299669743429. Cophenetic correlation for Cityblock distance and weighted linkage is 0.5341149573795416.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.7622431654960836, which is obtained with Cityblock distance and average linkage.
Euclidean distance only below
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df_h, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df_h))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.7115641567913893. Cophenetic correlation for complete linkage is 0.6345330419501592. Cophenetic correlation for average linkage is 0.7603860981510149. Cophenetic correlation for centroid linkage is 0.7348640120616929. Cophenetic correlation for ward linkage is 0.6290703444186456. Cophenetic correlation for weighted linkage is 0.6744876679395055.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.7603860981510149, which is obtained with average linkage.
Highest cophenetic correlation is obtained with euclidean distance and centroid linkage.
Dendogram to describe linkage methods
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df_h, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df_h))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations
HCmodel = AgglomerativeClustering(n_clusters=9, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df_h)
AgglomerativeClustering(n_clusters=9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(n_clusters=9)
subset_scaled_df_h["HC_Clusters"] = HCmodel.labels_
data["HC_Clusters"] = HCmodel.labels_
cluster_profile = data.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
data.groupby("HC_Clusters")["GICS Sector"].count().values)
# let's see the names of the Sectors in each cluster
for cl in ldata["HC_Clusters"].unique():
print(
"The",
data[data["HC_Clusters"] == cl]["GICS Sector"].nunique(),
"GICS Sector in cluster",
cl,
"are:",
)
print(data[ldata["HC_Clusters"] == cl]["GICS Sector"].unique())
print("-" * 100, "\n")
The 0 GICS Sector in cluster 7 are: ['Industrials' 'Health Care' 'Financials' 'Consumer Staples' 'Utilities' 'Consumer Discretionary' 'Information Technology' 'Telecommunications Services' 'Materials' 'Energy'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 8 are: ['Health Care' 'Financials' 'Consumer Discretionary' 'Information Technology' 'Consumer Staples' 'Industrials'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 2 are: ['Information Technology' 'Health Care' 'Consumer Discretionary' 'Real Estate' 'Industrials' 'Financials' 'Consumer Staples'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 0 are: ['Information Technology' 'Financials' 'Industrials' 'Health Care' 'Energy' 'Materials' 'Telecommunications Services' 'Real Estate' 'Consumer Staples'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 3 are: ['Consumer Staples' 'Financials' 'Information Technology' 'Health Care' 'Consumer Discretionary' 'Materials' 'Industrials' 'Utilities' 'Real Estate'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 6 are: ['Information Technology' 'Industrials' 'Financials' 'Health Care' 'Real Estate' 'Materials' 'Consumer Discretionary' 'Energy'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 1 are: ['Utilities' 'Real Estate' 'Materials' 'Industrials' 'Consumer Discretionary' 'Financials' 'Consumer Staples' 'Telecommunications Services' 'Health Care' 'Information Technology'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 5 are: ['Financials' 'Real Estate' 'Health Care' 'Information Technology' 'Energy' 'Consumer Discretionary' 'Consumer Staples' 'Materials'] ---------------------------------------------------------------------------------------------------- The 0 GICS Sector in cluster 4 are: ['Energy' 'Materials' 'Information Technology'] ----------------------------------------------------------------------------------------------------
9 clusters nicely distributes the GICS Sectors across each cluster.
# lets display cluster profile
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 52.274599 | 4.020981 | 1.363819 | 18.511752 | 43.017094 | -106017976.495726 | 689909333.333333 | 2.650000 | 282470678.092906 | 20.177471 | -0.576174 | 117 |
| 1 | 81.025904 | 4.878301 | 1.408869 | 15.356383 | 37.148936 | 29035061.170213 | 683162617.021277 | 2.505745 | 239421449.873192 | 32.914241 | -9.642821 | 47 |
| 2 | 54.700478 | 7.806808 | 1.442866 | 13.484848 | 63.242424 | -532798943.181818 | 2612261534.090909 | 3.032538 | 996691306.916288 | 22.631248 | -0.796261 | 33 |
| 3 | 128.458697 | 14.993020 | 1.556762 | 22.282609 | 150.543478 | 129949934.782609 | 704940701.086957 | 3.454946 | 250771843.461739 | 38.691710 | 6.029350 | 23 |
| 4 | 145.321856 | 4.828959 | 1.445543 | 25.887097 | 56.403226 | -6625068.548387 | 1403766161.290323 | 7.693952 | 163163361.012903 | 21.491745 | -0.879290 | 31 |
| 5 | 84.614231 | 6.663388 | 1.238605 | 39.442308 | 64.480769 | 261761759.615385 | 4083769639.423077 | 6.375240 | 837787187.915385 | 14.016251 | -2.475859 | 26 |
| 6 | 37.447733 | 5.870416 | 1.507650 | 14.440000 | 79.540000 | 601491440.000000 | 2466883120.000000 | 2.676600 | 831983486.050500 | 17.959717 | -1.584191 | 25 |
| 7 | 32.947827 | -11.234540 | 2.393770 | 31.478261 | 43.347826 | -136632809.782609 | -1356654576.086957 | -2.242609 | 473814086.308804 | 50.100914 | 0.689859 | 23 |
| 8 | 55.362667 | -2.348649 | 1.680686 | 14.533333 | 180.733333 | 371484300.000000 | 785571666.666667 | 1.757333 | 509841246.581667 | 34.385524 | 7.239907 | 15 |
fig, axes = plt.subplots(4, 3, figsize=(20, 35))
fig.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=10)
counter = 0
for ii in range(11):
sns.boxplot(
ax=axes[ii][0],
y=ldata[num_cols[counter]],
x=ldata["HC_Clusters"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][1],
y=ldata[num_cols[counter]],
x=ldata["HC_Clusters"],
)
counter = counter + 1
sns.boxplot(
ax=axes[ii][2],
y=ldata[num_cols[counter]],
x=ldata["HC_Clusters"],
)
counter = counter + 1
fig.tight_layout(pad=2.0)
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-270-878854bce2de> in <module> 4 for ii in range(11): 5 sns.boxplot( ----> 6 ax=axes[ii][0], 7 y=ldata[num_cols[counter]], 8 x=ldata["HC_Clusters"], IndexError: index 4 is out of bounds for axis 0 with size 4
fig, axes = plt.subplots(5, 3, figsize=(20, 35))
fig.suptitle("Boxplot of original numerical variables for each cluster", fontsize=20)
counter = 0
for ii in range(5):
sns.boxplot(ax=axes[ii][0], y=df1[num_cols[counter]], x=df1["HC_Clusters"])
counter = counter + 1
sns.boxplot(ax=axes[ii][1], y=df1[num_cols[counter]], x=df1["HC_Clusters"])
counter = counter + 1
sns.boxplot(ax=axes[ii][2], y=df1[num_cols[counter]], x=df1["HC_Clusters"])
counter = counter + 1
fig.tight_layout(pad=2.0)
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-157-54f2cac747e5> in <module> 7 sns.boxplot(ax=axes[ii][1], y=df1[num_cols[counter]], x=df1["HC_Clusters"]) 8 counter = counter + 1 ----> 9 sns.boxplot(ax=axes[ii][2], y=df1[num_cols[counter]], x=df1["HC_Clusters"]) 10 counter = counter + 1 11 IndexError: list index out of range
Dimensionality reduction with PCA
# importing library
from sklearn.decomposition import PCA
# setting the number of components to 2
pca = PCA(n_components=2)
# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df_h)
reduced_df_pca = pd.DataFrame(
data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
# explains _% of data
pca.explained_variance_ratio_.sum()
0.549433097391417
sns.set(font_scale = 4)
sns.scatterplot(data=reduced_df_pca, x="Component 1", y="Component 2")
<AxesSubplot:xlabel='Component 1', ylabel='Component 2'>
sns.set(font_scale = 4)
sns.scatterplot(
s = 250,
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df1["HC_Clusters"],
palette="muted"
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x256e1567c70>
The clusters seem to be well separated from each other with some small exceptions for Clusters 3 and 4. 9 clusters seem to be a good match for dividing the data.
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
-
Hierarchical clustering yielded more clusters with more diversity.
By reviewing different clusters, stocks could be diversified by using a selection of stocks from each cluster to help offset any detrimental changes in one of the sectors. This could help identify which sectors to combine.